03. A3C

A3C Algorithm

Recent autonomous mobile navigation systems leveraging Deep RL, such as the one described in the Mirowski paper, use the A3C algorithm. The A3C method, which stands for “Asynchronous Advantage Actor-Critic” is faster and more robust than DQN, and has essentially replaced DQN in many applications. In the following videos, Arpan provides a visual overview of how actor-critic algorithms work in general. A3C builds on actor-critic with the innovations of multiple “asynchronous” workers as well as an “advantage” feature.

Actor-Critic

RL M2L4 01 行动者-评论者方法 RENDER V1 V1

RL M2L4 04 行动者和评论者 V1

Advantage

We’ve seen that the value function estimate is used as a scoring function by the “critic” to evaluate the “actor” policy function. As a further refinement, the advantage function can be estimated and used as the scoring function by the critic instead. The advantage function approximates the advantage of one action over others in a given state with the following equation:

A(s,a)=Q(s,a)-V(s)

A Q-value tells us how much reward we expect to get by taking action a in state s, whereas,the advantage value, A(s,a) tells us how much more reward we can expect beyond the expected value of the state, V(s). Put another way, it answers the question: “What are we gaining by taking this action instead of any action at random?”

Asynchronous

“Asynchronous” refers to the idea that multiple networks are trying to solve the problem in parallel in A3C. These “worker” agents can explore different parts of the environment simultaneously. This speeds up the learning and results in more diverse training.